Model Selection

Multimodal generation

# Multimodal generation

OmniGen2 is a powerful and efficient unified multimodal model composed of a 3B vision-language model and a 4B diffusion model, supporting visual understanding, text-to-image generation, instruction-guided image editing, and context generation.

BLIP is a Transformer-based image-to-text generation model that can generate natural language descriptions for input images.

Skyreels V2 DF 1.3B 540P

SkyReels V2 is the first open-source video generation model adopting an autoregressive diffusion forced architecture, supporting unlimited-length movie generation, achieving state-of-the-art performance among public models.

Video Processing

A speech-language model based on Qwen2.5-7B extension, supporting speech-text interleaved training and cross-modal generation

Transformers English

Qwen2 Vl Instuct Bpmncoder

4-bit quantized version based on Qwen2-VL-7B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speedup

Transformers English

Swin Distilbertimbau

Brazilian Portuguese image captioning model based on Swin Transformer and DistilBERTimbau

Transformers Other

Show O W Clip Vit

Show-o is a PyTorch-based any-to-any conversion model focused on multimodal task processing.

Show-o is an any-to-any conversion model based on PyTorch, supporting input and output conversion across multiple modalities.

MGM-7B is an open-source multimodal chatbot trained on Vicuna-7B-v1.5, supporting high-definition image understanding, reasoning, and generation.

A text-to-video generation model based on the AllenNLP library, capable of generating corresponding video content according to the input text description.

CheXagent is a foundational model for chest X-ray interpretation, which can assist the medical field in professionally interpreting chest X-ray images.

Vit Roberta Fa Image Captioning Flickr30k

A Persian image captioning model based on ViT+RoBERTa architecture, specifically designed to generate Persian text descriptions from images

Image-to-Text Other

Blip Base Captioning Ft Hl Narratives

BLIP model fine-tuned on HL Narratives dataset for generating high-level narrative image descriptions

Transformers English

michelecafagna26

Blip Base Captioning Ft Hl Scenes

This model is an image captioning model based on the BLIP architecture, specifically fine-tuned for high-level scene descriptions.

Transformers English

michelecafagna26

This model can convert text descriptions into video content and is suitable for various creative and automated scenarios.

This is a text-to-video model that can convert the input text description into corresponding video content.

Text2video Zero Controlnet Canny Arcane

Text2Video-Zero is a zero-shot text-to-video tool supporting edge guidance and mystical style

Vit Rugpt2 Image Captioning

This is an image captioning model trained on a translated version (English-Russian) of the COCO2014 dataset, capable of generating Russian descriptions for input images.

Transformers Other

Igpt Fr Cased Base

A French incremental pre-trained language model based on GPT-fr with text-to-image generation capabilities

Transformers French

MolT5 is a large language model for translation between molecules and natural language, built on the T5 architecture.

Molecular Model

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase